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Abstract 



Two major sources of stochasticity in the dynamics of neutral alleles result from resampling 
of finite populations (genetic drift) and the random genetic background of nearby selected 
alleles on which the neutral alleles are found (linked selection). There is now good evidence 
that linked selection plays an important role in shaping polymorphism levels in a number of 
species. One of the best investigated models of linked selection is the recurrent full sweep 
model, in which newly arisen selected alleles fix rapidly. However, the bulk of selected 
alleles that sweep into the population may not be destined for rapid fixation. Here we 
develop a general model of recurrent selective sweeps in a coalescent framework, one that 
generalizes the recurrent full sweep model to the case where selected alleles do not sweep to 
fixation. We show that in a large population, only the initial rapid increase of a selected 
allele affects the genealogy at partially linked sites, which under fairly general assumptions 
are unaffected by the subsequent fate of the selected allele. We also apply the theory to 
a simple model to investigate the impact of recurrent partial sweeps on levels of neutral 
diversity, and find that for a given reduction in diversity, the impact of recurrent partial 
sweeps on the frequency spectrum at neutral sites is determined primarily by the frequencies 
achieved by the selected alleles. Consequently, recurrent sweeps of selected alleles to low 
frequencies can have a profound effect on levels of diversity but can leave the frequency 
spectrum relatively unperturbed. In fact, the limiting coalescent model under a high rate of 
sweeps to low frequency is identical to the standard neutral model. The general model of 
selective sweeps we describe goes some way towards providing a more flexible framework to 
describe genomic patterns of diversity than is currently available. 



1 Introduction 



The high levels of genetic variation within natural populations have long fascinated popu- 
lation geneticists. One school of thought holds that a su bstantial proportion of this molec- 
ular polymorph i sm is neutral or very weakly deleterious ( IKimura and OhtaI . I1971I : IOhta 



19731 : IKimuraI . Il983l ). For neutral polymorphism, the level of genetic diversity results 



from a balance between t h e introduction of alleles through mut ation and their stochastic 



loss (IKimura and CrowI . 1 1964 IKimuraI . Il969l : IEwensI . [1972). Under the neutral the 



ory of mo l ecular evolution this stochasticity is thought to result mostly from genetic drift 
(IKimuraI . Il983l ). the random resampling that occurs in finite populations, an effect that 



is exaggerated by fluct uating population size and large variation in reproductive success 



among individuals (see ICharleswqrthI . |2009| . for a recent review). However, selection at 



linked sites may provide a major source of stochasticity as the dynamics of a neutral allele 
can be strongly influenced by the random genetic backgroun d on which selected allele s arise 



Hudson and Kaplan 



(Imaynard Smith and Haigi^ . Il974l : Ikaplan et a/.l . 119891 : ICharlesworth et al. 

1995bl l 



1995 



In many species examined to date, levels of diversity are substa ntially lower in regions 



of low recombination, as found in multiple species of Drosophila (IAguade et all Il989 
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Begun and Aouadro. 


1992; 


Berry et al. 


1991; 


Shapiro et al. 


2007; 


Begun et al. 


2007). 


Caenorhabditis ( 


Cutter and Payseur. 


2003; 


Cutter and Choi 


.2010 


) , humans ( 


Hellmann et al 


2008; 


Cai et al. 


2009) and Saccharomvces cerevisiae ( 


Cutter and MosesI. 


2011) 


but not 
loreover, 


in all species, e.g. Arabidopsis ( 


NORDBORG et al. 


. 20051: 


Wright et al. 


2006|). ^ 



levels of diversity are also lower in regions that a priori are expec ted to have a high e r rate 



of functional mutations, e.g. near genes a nd conserved elements (iMcVlCKER et all 12009 



Cai all l2009l ; [Hernandez et all l201ll ). Since the rate of neutral genetic drift is inde- 
pendent of recombination rate, this positive correlation between recombination rates and 
diversity offers good evidence that linked selection plays a substantial role in the fate of 
alleles, especially in low recombination regions. What is still far from clear is how differ- 
ent forms of linked selection contribute to this reduction, and whether linked selection can 
explain the narrow obser ved range of genet i c diversity across species with vastly different 
(census) population sizes ( ILewontinI . Il974t Imaynard Smith and Haighi . Il974j ). 

Models of the effect of linked selection have often been divided between those that pro- 
pose the source of this linked selection to be either the purging of deleterious variation 
(background selection) or the selective sweep of beneficial alleles (hitchhiking). In this paper 
we explore the consequences of a generalized model of hitchhiking on patterns on neutral 
diversity. We first review some of the key results of models of linked selection. Under 
the background selection model, genetic diversity is continuously l ost from natural popula- 



tions due to the removal of haplotype s that carry deleterious alleles (ICharlesworth et al 



19951 : IHudson and KaplanI . Il995bl ). For strongly deleterious alleles, this continuous loss 



acts primarily to increase the ra te of genetic drift at markers closely linked to loci wit h 



high deleterious mutation rates ( IHudson and Kaplani . Il995al : INordborg et all Il996l ). 
Therefore, this background selection model leads to a reduction in genetic diversity but no 
skew in the frequency spectrum. However, a skew towards rare neutral alleles can result 



if weakly dele t eriou s mutations are incorporated into the model ( INordborg et all Il996 



Gordo et all l2002f ) 



On the other end of the spectrum, IMaynard Smith and Haigh ( 119741 ) proposed that 
local levels of genetic diversity could be reduced by the hitchhiking effect. The hitchhiking 
effect results from the fact that when an initially rare, beneficial allele sweeps rapidly to 
fixation it carries with it a linked region of the haplotype on which it arose. The size of 
genomic region affected by a recent sweep is proportional to the ratio of the strength of 



selection to the rate of recombination (IMaynard Smith and Haighi . Il974l : IKaplan et al 



19891 : IStephan et all Il992l : IBartoni . Il998l ). and so the reduction in levels of diversity is 
determined by the distribution of selection coefficients and the rate of sweeps per unit of 
the genetic map. Neutral alleles further away from the selected site may not be pulled all of 
the way to fixation if recombination occurs during the sweep, which can lead to a transient 
excess of high-fre quency derived alleles an intermediate distance away from the selected site 
after each sweep ( IFay and Wul . I2OO0I : IPrzeworskj . I2OO2I : IKimI . boOGf ). As neutral diversity 
levels slowly recover through an influx of new mutations after the sweep there is a strong 
skew towards low frequency derived allel e s, a pattern that persists for many generations 



( IBraverman et all Il995l IPrzeworskiI . I2OO2I : IKimI . I2OO6I ). In a large population, the 
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rate of sweeps could be high enough that hitchhiking dominates genetic drift as the source o f 
stochasticity (IMaynard Smith and HaighI . Il974l: Ikaplan et all Il989l : IGillespieI I2OO0I ) , 
an idea which has been termed genetic draft (IGillespieI . |2000| ). 

Support for a hitchhiking model over the standard model of background selection is found 
in Drosophila, where there is a greater skew towards rare alleles at putatively neutral sites in 



regions of low recombination (IAndolfatto and PrzeworskiI I2OOII : IShapiro et aUl2007f) 



and regions surrounding ara i no-ac i d substitutions have low er levels of diversity (IAndolfatto 



20071 : IMacpherson et all 120071 : ISattath et all l201ll ). However, in human s (and other 



species) there is no strong skew t owards rare alleles in l ow re combination regions (IMcVlCKER et al. 



2009; Hernandez et al 



dence (ICOOP et am2009 



20 111: ILohmueller et all 1201 ll ) , which combined with other evi- 



Hernandez et all I2OIII ) suggests that full sweeps may have been 



rare, and that background selection may be the main mode of linked selection, in humans 
and a number of other species. 

Although the recurrent full sweep model has been the subject of considerable theoretical 
investigation, it may actually be relatively rare fo r advantageous alleles to sweep rapidly all 



the w ay to fixa tion. Fluctuating environments (e.g. IGillespieI . I1991I : IKopp and Hermisson 



20071 : ?. l2009al Jbl) and changing genetic ba ckgrounds may often act to prevent alleles achieving 



rapid fixation within the population (see Pritchard et al. ( 2010l ) for a recent discussion) 



For example, if multiple mutations affecting the adaptive phen otype segregate during the 
sweep th en it may be that no one of these alleles sweeps to fixation (IPennings and HermissonI . 
2006al lbl: IChevin and HospitalI . boOSi : IRalph and CoopI . I2OIOI ) . Multiple alleles spreading 



rapidly from low frequency can lead to either a set of partial sweeps within the population, 
or a soft sweep if the alleles are tightly linked. Furthermore, a similar effect can occur when 
selection acts on an allele present as standing variation , if th e allele is present on multiple 



haplotypes when it starts to sp read (IInnan and KimI . l2004l : IHermisson and Pennings 



20051 : IPrzeworski et all l2005l ). The fact that, under these models, no single haplotype 



goes quickly to fixation acts to reduce the hitchhiking effect, and alters the effect on the 
frequency spectrum. 

The genome-wide effect of other modes of linked selection on patterns of diversity is 
relatively unexplored. One model that has been investigated is an infinitesimal model 
of directional selection, where the aggregated effect of selection over many loci ca n be a 



substantial source of stochasticity at linked and even unlinked sites (IRobertsonI . I1961 



Santiago and CaballeroI . Il995l . Il998l : IBartonI . I2OO0I ) . Fluctuating selection due to vary- 



ing e nvironments has also been shown t o lead to reduced levels of diversity at linked neutral 
sites (IGlLLESPlEi 1 1994 119971 : IBartor I2OO0I ) and simulations of specific models of fluctu- 
ating selection have shown that the same reduction in diversity can result in a r riuch smalle r 
skew in the frequency spectrum than under the hitchhiking model (IGillespieI . Il994l . Il997l ). 
However, as yet no coalescent model of the effect of recurrent incomplete sweeps has been 
developed. 

Here is an outline of how we proceed. First, we develop a coalescent-based model of 
patterns of diversity surrounding a selected allele that sweeps into the population but not 
necessarily to fixation. We concentrate on the case of a very large population and sites that 
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are partially linked to this selected locus. We find that if the initial rise of the selected allele is 
rapid then the coalescent process is primarily affected by this stage, and relatively insensitive 
to the subsequent dynamics of the selected allele. Using this intuition, we then develop a 
coalescent model of recurrent sweeps on patterns of neutral diversity in which selected alleles 
may only reach intermediate frequency. To test the approximations involved in the model 
we compare the results at several stages to simulations. Some of the implications of these 
results for interpretation of genome-wide diversity patterns are presented in the discussion. 



2 Results 

2.1 Coalescent framework and assumptions 



As first described bv IKaplan et al\ (119881 ) and IHudson and KaplanI ( 119881 ) . patterns of 
neutral diversity at a neutral locus linked to a selected locus can be modeled by conditioning 
on the trajectory of the frequency of the selected allele through time, and treating the 
two allelic classes as subpopulations within each of whic h the dynamics are neutral , with 



recombination moviii g lineages between the two (see also IBarton and EtheridgeI 12004 



Barton et all |2004( ). Consider a locus under selection at which a derived allele D and an 
ancestral allele A segregate, and let the frequency of D at time t be denoted X{t). We will 
study the coalescent process at a neutral locus partially linked to our selected locus, with 
recombination r per generation between the selected and the neutral locus. 

Each ancestor on a given lineage in the coalescent process carried either the D or the A allele 
at the selected locus, which we refer to as the "type" of that lineage. 

Throughout we assume that the diploid population size is large and constant over 
time. For simplicity, we assume that the effective population size is 2A^, (i.e. the neutral 
coalescence rate of a pair of lineages is 1/(2A^)) and that no more than two lineages coalesce 
at once in the absence of a selective sweep. 

Suppose at time t that ko and /c^ of our lineages are of the derived and ancestral type 
respectively. There are NX{t) individuals carrying the derived allele that could be progeni- 
tors of the ki) lineages, so the instantaneous rates of coalescence of pairs of lineages within 
the two allelic classes at time t are 

^ and ( rf ) , , , , respectively. (1) 



2 J 2NX{t) \2 J 2N{1 - X{t)) 

The total instantaneous rate of recombination is {ko + kA)r. If a recombination event occurs 
on a lineage at time t, it chooses to be of type D with probability X{t), and chooses to be 
of type A otherwise. 

We will leave the dynamics of the selective sweeps that determine X{t) fairly unspecified, 
and while stochasticity may play an important role in shaping the trajectories, in examples 
we usually treat X{t) as nonrandom. As we want coalescences caused by a single selective 
sweep to occur at more or less the same time, we require that once the selected allele is 
introduced into the population it increases in frequency rapidly, and that once the allele 
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frequency leaves the boundary (e.g. moves above 1%), it does not return (e.g. drops below 
1%) unless it does so on the way to loss (e.g. hits before returning to 1%). This condition 
implies that our model applies to alleles that are at least partially codominant, as fully 
recessive alleles spend appreciable time, behaving stochastic ally, at very low frequencies 



which can lead to d i fferen t coalescent dynamics at linked loci ( ITeshima and Przeworsk] 



20061 : lEwiNG et al\ . 1201 if ). 



2.2 Relation to previous models 

We describe a simple approximation to the coalescent with recurrent sweeps that is inspired 
by similar approximations for a model of recurrent full sweeps. The approximation postulates 
two types of coalescent events - "neutral" events occurring at rate 1/2N between any pair 
of lineages, and additional coalescent events, involving two or more lineages, due to selective 
sweeps. The first class of events can occur at any time, due to random resampling of lineages. 
The second class of events, the sweep-induced coalescent events, can involve more than two 
lineages, as we assume that lineages forced to coalesce by a sweep do so instantaneously on 
the relevant time scale. We assume that all such lineages coalesce into a single lineage, and 
that the distribution of the number of such lineages is binomial, with a success probability 
that is a function of the trajectory taken by the selected allele and the recombination distance 



sweeps ( 


Barton. 


1998; 


Gillespie. 


2000; 


Kim and Stephan. 


2002; 


Nielsen et al. 


2005 


DURRETT and Schweinsberg. 


200^ 


>)■ 



Processes with two classes of coalescent events have p reviously been developed to approxi- 

GillespieI . |200oI ; Idurrett and Schweinsberg 



mate a recurrent full-sweep model (IKaplan et ami98 



20051 ). When the transition probabili ties can be written in th is binomial form, as they also are 
i n the recurrent full sweep models of IGillespieI (200oh andlPuRRETT and Schweinsberg 



(l2005l ). the model is called a A-coalescent ( Pitman . Il999l ; ISagitovI . Il999l ). These also 
arise in neutral models wh e re individuals have large varian c e in r eproductive success (e.g. 



Sargsyan and WakeleyI . l2008l ; IMohle and S. SagitovI . l200ll ). As in other work, we 
present this model as an approximation not in the sense of asymptotic convergence, but 
rather as a simplification, which we show later is close enough to be useful. We make a 
number of simplifying assumptions, and often do not make use of the most accurate analyt- 
ical forms available, in an effort to maintain an intuitive form an d description of the process 
obtained. In particular, IDurrett and SchweinsbergI (120041 ) showed that a coalescent 
process with simultaneous multiple coUisions could provide a better appro ximation to the 



coalescent process durin g a sweep, a direction we do not pursue (see also IBartoni . I1998 



Etheridge et all l2006l ) 



2.3 An approximation to the coalescent process during the sweep 

Figure [T]A. shows an example of the relationships between different sampled individuals at 
a neutral locus in a finite population undergoing recurrent selective sweeps. At the times 
indicated by the lightning bolts, selective alleles sweep into the population at some locus 
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linked to our neutral site. All lineages descended from the original carrier of the derived 
allele coalesce, nearly instantaneously on this time scale. 

Figure [TJ3 zooms in on one of these selective sweeps. The derived allele at the selected 
locus (D) arose r generations ago. The five surviving ancestral lineages recombine on and off 
the D background, whose frequency through time is shown by the dark grey shading. Just 
after time those lineages on the D background coalesce as X goes to zero (their coalescent 
rate, which is proportional to 1/X, goes to infinity). We will show that the complexity of 
the process shown in Figure [T]B can be approximated by a much simpler multiple merger 
coalescent process suggested by Figure [T]\, in which lineages coalesce "neutrally" at rate 
1/{2N), and furthermore, each lineage flips a coin at each selective sweep to decide which 
type they are, and those that are of type D merge simultaneously. 

Suppose that a derived allele at the selected locus {D) arose r generations ago, at time 
0. The selected mutation may still segregate within the population in the present day, or 
may have gone to fixation or loss sometime before the present (in which case X{t) = 1 or 
respectively). First consider coalescences occurring very close to the origin of a selective 
mutation. A lineage can be type D at time for one of two reasons: either it was of type D in 
the present day and not yet recombined off the D background, or at the first recombination 
after the selected allele arose, the lineage chose to be of type D. The lineage of an individual 
drawn at random from the present-day population is therefore of type D at time with 
probability 



Here the integral is over t, the number of generations between the origin of D and the first 
subsequent recombination on a lineage {t is marked for the red lineage in Figure [TJ3). Note 
that although many recombination events may have occurred, since at each recombination 
event the lineage chooses a new type independently of its previous type, we need only consider 
the first after the sweep. If r is much larger than 1/r the first term can be ignored, so we 
commonly assume that 



as the allelic state of the sample has long been forgotten. Importantly, we can see that the 
dependence of g on X decays exponentially through time at rate r. Therefore, the fate of the 
selected allele more than a few multiples of r after it arose, including its presence or absence 
in the present day, will have little effect on q. Concretely, for two trajectories labeled 1 
and 2, if Xi{s) = X2{s) for all < s < T, then regardless of subsequent differences in the 
trajectories, |gi — 52! < e"*"^. 

We can now approximate the rapid coalescence of lineages that are forced by the sweep 
by assuming that all lineages descended from the original carrier of the D allele coalesce 
simultaneously when the selected allele appears (a "multiple merger"). The lineages will 
actually coalesce at slightly different times, but the assumption the derived allele increases 
rapidly implies that this difference is small on the coalescent time scale {i.e. o{2N)). As each 
lineage takes part in this merger independently with probability q, the probability that i out 




(2) 




(3) 
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A. Multiple mergers coalescent 



B. Coalescent with trajectory 




Figure 1: (A) An example of a multiple-merger coalescent genealogy. Eight alleles 
have been sampled in the present day, and we trace their lineages backwards through time, 
up the page. Lightning bolts indicate the times when a selected allele has swept into the 
population. At each sweep, each lineage is either descended from the original carrier of the 
derived allele at the selected site (lineages marked with a black dot) or from some other 
ancestor (lineages marked with a white dot). (B) Zooming in on one sweep. The 
frequency of the derived allele, through time, X{t), is shown in dark grey. The four 
surviving lineages are shown in different colors as in (A). Horizontal dotted lines depict 
recombination events in the history of a lineage. A dot indicates the oldest recombination 
event experienced by each of our lineages before the D allele arose, and the color of the 
dot indicates where the allele recombined onto the D background (black) or on to the A 
background (white). As we approach the time the selected allele arose, the three lineages 
found on the D background coalesce into a single lineage. 
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of k surviving lineages coalesce at time is 



for 2<i<k, 



(4) 



reducing the number of lineages from k to k — i + 1. 

This approximation assumes that each lineage makes an independent choice of whether 
to recombine off the sweep, which is equivalent to assuming that the coalescences caused 
by the sweep form a 'star'-like tree, with no internal edges of nonzero length. Therefore, 
the approximation ignores dependencies between lineages induced by coalescent events ear- 
lier in the sweep, and so is a poorer approximation for large number of lineages. More 
sophisticated approximations have been developed to account for this dependency, which 
impr o ve the properties for large samples ( Barton ^ 19981: Durrett and Schweinsberg , 



20041 : IEtheridge et all I2OO6I : IPfaffelhuber et all I2OO6I ). However, we beheve this ap 



proximation captures many of the important features. 

The other component of our approximation is that at all time, all pairs of lineages coalesce 
at rate 1/{2N) regardless of their allelic background. This approximation ignores the fact 
that lineages that are currently on different backgrounds cannot coalesce and that lineages 
on the same background coalesce at a higher rate (see equation ([T])). 

We should also note that although large changes in the allele frequency over a small num- 
ber of generations represent a large number of children descended from a smaller number 
of ancestors, this will not cause rapid coalescence in a large population if the allele remains 
at intermediate frequencies. Concretely, consider a short time interval from generation ti 
to generation t2, over which interval X{t) ^ (^2 — ti)/N. The chance that any coalescence 
occurs during this time interval on the derived background is small (0((t2 ~ 
regardless of how the frequency X changes. Therefore, large, sudden changes in allele fre- 
quencies will only force coalescence on the derived background if X{t) is of order 1/N (and 
similarly for the ancestral background) . For sites that are only partially linked to the selected 
locus, if recombination is moving the lineages across backgrounds at a sufficiently high rate 
compared to neutral coalescent rat e (Nr ^ 1), then two lineage s in this subd i vided model 
coales ce at a rate close to 1/2N fseelHupsoN and KaplanI fll988l ) : [heyI fll99ll ) : INordborg 
( 1997 ). and Barton and EtheridgeI ( 2004 ) for a detailed discussion). As such our approx- 
imation will therefore be worse close to the selected site, but is asymptotically correct for 
large r. 



2.3.1 A simple trajectory 

To build intuition, we first consider a simple trajectory, making further approximations to 
keep the results accessible, and compare the results to full coalescent simulations. Assume 
that D arises r generations ago at a site at distance r from the neutral site under consid- 
eration, rapidly sweeps to frequency x, and remains close to this frequency for a time much 
greater than 1/r. Under many models of directional selection, most of the time spent in 
reaching x is spent at low frequency, so that any recombination that occurs during this time 
will likely move a lineage to the ancestral type, and so only lineages that do not recombine 
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during the initial sweep will coalesce. If we let tx be the time it takes for the selected allele 
to sweep to x and assume rr ^ 1, then a simple approximation to g(r, X) is therefore (with 
the subscript emphasizing dependence on x) 

q{r,X) ^ := xe"''*''. (5) 

If the initial increase of D is driven by additive selection of strength s with Ns > 1, then 
the initial trajectory of D will be logistic, and it is reasonable to take = log — 
where a is 2N or ANs depending on whether s is of order 1 or 1/N the latter case 
corresponding to the case where the selected allele has to rapidly achieve frequency l/{Ns) 
to escape loss by drift). Using to approximate the probability that a lineage is caught by 
the sweep, the expected pairwise coalescent time is smaller by a factor of 

(l-g^e-^/(^^)) (6) 

which can be found by considering whether a pair of lineages coalesce before, during, or after 
the sweep. 

If rather than remaining near x, the selected allele continues to sweep to fixation - perhaps 
it is still under selection with strength S2 ^ r - then q^ ^ e"^*"" because the selected allele 
has gone quickly to fixation as in a full sweep, and the only time for recombination is in the 
early phase of the trajectory t^- On the other hand, if the allele became strongly deleterious 
(—52 ^ r), then g f» 0, because there is little chance of it contributing genetic material to 
the population. However, if selection subsequently experienced by D is weak (|s2| ^ r), so 
that subsequent dynamics of the selected allele are sufficiently slow, then q and therefore the 
coalescent process are independent of the eventual fate of the selected allele. In summary, 
for qx to be a good approximation to q{r, X) and for the sweep to have an appreciable effect 
on the coalescent, we need |s2| <C r < s. 



Comparison to simulation 

To demonstrate this, we will apply the same approximation to situations with different 
long-term behaviors. We consider five different possible trajectory types. In all cases, the 
initial rise of D was modeled as deterministic logistic growth begun at frequency 1/2N and 
adjusted to reach frequency x after tx units of time. In the first case ("balanced"), the allele 
remains thereafter at frequency x. In the next two cases (Figures |2j\-C), after time t^, 
allele D approaches either frequency 1 ( "fixed" ) or frequency ( "lost" ) logistically, reaching 
frequency 1 — 1/2N (or 1/2N respectively) after the next r time units. In the last two cases 
(Figures |2p-F), the allele D remains at x for T generations, and then proceeds logistically, 
in time tx, either to frequency 1 — 1/2N ("step") or frequency l/2iV ("top-hat "). 



In each case, we used mssel (a modified version of ms f lHuDSONl . |2002| ) that allows 
an arbitrary trajectory, kindly supplied by Richard Hudson) to simulate genealogies for a 
recombining sequence surrounding a selected locus at which a selected allele performs one 
of the trajectories shown in Figure |2] . The average pairwise coalescence time from these 
simulations was calculated by dividing the pairwise genetic diversity by the mutation rate. 
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and is shown in Figure |2] at different distances from the selected locus, compared to the 
quantity predicted by equation (jS]). Close to the selected site (e.g. for r < 1/T in Figure OH 
and F) the curves diverge, since the sites represented by the blue curves see a full sweep, 
reducing diversity close to the selected site, while those in the orange curves see a short-term 
balanced polymorphism, and hence show a peak in polymorphism near the selected site). 
As we increase recombination distance away from the selected site, the three curves are in 
good agreement with the black line (equation (E])), indicating that our partial sweep model 
captures the main effect on diversity. 

Our simple approximation describes diversity levels well at partially linked sites over a 
range of different scenarios, and works well for a wider range of parameters (results not 
shown). We furthermore used equation (jl]) to predict the effect of this simple partial sweep 
on the coalescent process of more than two lineages, and found close agreement with fur- 
ther mssel simulations for various summaries of diversity such as the expected number of 
segregating sites (results not shown). Overall, these results confirm that for partially linked 
sites, the coalescent process is mostly determined by the initial rapid behavior of the selected 
allele. 



2.4 A recurrent sweep coalescent model 

We now consider patterns of diversity at a neutral locus affected by many different selected 
alleles that sweep into the population at the times of a homogeneous Poisson process with 
rate v. We assume that the sweep rate is low enough that sweeps do not interfere with 
each other, and return to discuss this assumption later. Each sweep occurs at some distance 
r from the neutral locus, and as it sweeps its frequency follows some particular trajectory 
X(t), which together in equation ^ determine g, the probability that a lineage at the neutral 
site is caught by the sweep. Rather than try to explicitly model randomness in these two 
components, at first we will assume that each sweep independently chooses its value of q 
from a probability distribution with densit y f{q). This model is exactly a Lambda coalescent. 



with K{dq) = q^i^f{q)dq + So{dq) /2N (see IBerestyckiI |2009| . for a recent review), but we 



leave our discussion in terms of / to make the results more intuitive. 

Following from our assumption that each lineage is affected by a given sweep indepen- 
dently with probability q, when there are k surviving lineages, the rate at which they coalesce 
to k — i + 1 lineages due to sweeps is 

J^) [\\l-q)'-'fiq)dq. (7) 







This follows from our assumption that sweeps occur homogeneously through time and do 
not interfere with each other, and properties of marked Poisson processes. For ease of 
presentation we denote 



4. = q\l-q)'-'fiq)dq. 



Recall that under our model, the rate of coalescence of pairs of lineages due to genetic drift is 
1/{2N), so that the rate at which the coalescent process with k lineages coalesces to k — i + 1 
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A. trajectories 



B. x= 0.4 



C. x= 0.8 




time 



1000 

Position, 4Nr 



1000 

Position, 4Nr 



Figure 2: The effect of a single partial sweep. (A) Three possible trajectories followed 
by the D allele after it arises r generations ago, described in the text: blue is "fixed", green 
is "lost", and orange is "balanced". (B) and (C) Mean pairwise coalescent time against 
recombination distance away from a selected site that has experienced one of the three types 
of sweeps shown in (A), with x = 0.4 and 0.8 respectively. The other parameters were 
tx/2N = 6.6 X 10~^ and t/2N = 0.05. (D) Another 3 possible trajectories: green is "top- 
hat" and blue is "step". (E) and (F) Pairwise coalescent time as in (B) and (C), but using 
the trajectories shown in (D). The other parameters were tx/2N = 6.1 x 10""^, t/2N = 0.1 
and T/2N = 0.02. The black line shows the approximation to the pairwise coalescent time 
of equation ([6]). In E and F, the vertical line grey line marks r = 1/T. 
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lineages is 

Afc,i = ^^^'2 + i^Ik,i for 2 < 2 < fc, (9) 

where Si^2 = 1 if i = 2 and otherwise. The total rate of coalescent events when there are k 
lineages is therefore 

^''=2^(2)+^^^'^'^ forfc>2, (10) 

^ i=2 

and conditional on a coalescent event the probability that i lineages out of k coalesce, re- 
ducing from k to k — i + 1 lineages, is 

(2) + f n^-^i 

= — = ^— , for2<z<A;. 11 

To simulate from this coalescent process we can simulate an exponential waiting time with 
rate A^, pick a number of lineages to coalesce using probabilities pk^k-i+i, and run this process 
until we have a single lineage remaining. 

Note that in deriving this process we have assumed that at all times, lineages also coalesce 
at a neutral rate 1/2N. This can be justified by assuming that recombination moves lineages 
between backgrounds at a high enough rate to allow the effects of the partitioning of the 
population by segregating alleles to be ignored. Therefore, the approximation will break 
down if a typical neutral site, at any given time, is close enough (e.g. within an r of order 
1/A^) to an allele maintained at intermediate frequency by long-term balancing selection 
(e.g. alleles maintained for time scales of order A^). Further work is needed to refine the 
coalescent under those conditions, but our approximations should be suitable for a broad 
range of scenarios and genomic regions. 



2.5 The coalescent process with homogeneous sweeps 

It is natural to examine the case in which selective sweeps occur at uniform rate along 
a sequence of total length L. We assume that this sequence recombines at rate tbp per 
base each generation, and that sweeps enter the population at a rate ubp per base each 
generation, so that the total rate of sweeps is = vspL. We also assume that the sweeps 
are homogeneous, i.e. the trajectory followed by the frequency of the derived allele, X, is 
independent of the distance between our neutral site and the site at which a sweep occurs. 

We will consider sweeps occurring along a very long chromosome and so will take L — 00, 
but then the total rate of sweeps, v = vbpL., also goes to infinity. To obtain a meaningful 
limit, we need that as L — )■ 00 the rate of sweeps corresponding to each nonzero value of q con- 
verges to a finite value. Recall from ([3]) that the probability a lineage is caught up in a given 
sweep depends on the distance to the sweep (which is tbp^ for a site £ bases away) and the 
trajectory X taken by the sweep, and is given by q{rBpi,X) = rppi Jq exp{—rBpit)X{t)dt. 
In a finite genome of length L, the probability distribution on values of q has density f{q) = 
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hL{q)/L, where hiiq) = Fx{q{rBP^, X) G dq}di. Here hiiq) is the rate at which selec- 
tive sweeps appear at location tbp^ and whose trajectory X gives qirBpf-^X) = q, integrated 
across the genome; and f{q) is hiiq) normalized to integrate to 1, since hL{q)dq = L. 
The functions hi converge for g > as L becomes large as long as the probability that 
distant sweeps affect the focal site decays quickly enough. We therefore assume that hL^q) 
converges to a finite limit h{q), i.e. that the following exists: 

h{q) = lim Lf{q) for < g < 1. (12) 

This means that although the total rate of sweeps per generation is infinite, only a finite 
number happen close enough to our neutral site to potentially affect our coalescent process. 
Therefore, the rate at which k lineages coalesce down to A; — z + 1 due to sweeps converges: 

Ubp L Ik,i Ubp g*(l - qf~'Kq) dq as L ^ oo. (13) 

If we take the trajectory X to be fixed, we can rewrite equation f fT3|) as 

^^BpQ - qf-%q)dq = z/bpQ j\{rBpi,Xy{l - q{rBpi,X))''-W 



ubp fk 



rpp W ^0 



oo 



q{r,Xy{l- q{r,X)f~'dr, (14) 



which decouples the dependency of the rate of sweeps on the recombination rate tbp from 
the trajectory X. If X is random, then we need to average over possible trajectories, and so 
we define 

'k\ r 

q{r,Xy{l-q{r,X)f^'dr , (15) 



Jk,^ = [ ^ )Ex 



where Ex[-] denotes the average over possible trajectories. We will assume that this integral 
is finite for 2 < i < k; for further discussion of these points see Appendix lA.li Importantly, 
under our assumption that sweeps do not interfere with each other, J^ i does not depend 
on the recombination rate r bp or the rate of sweeps i'bp, but only on the dynamics of the 
selective sweeps X. 

Allowing coalescent events due to drift, k lineages coalesce down to A; — i + 1 at rate 

■^fe.* = 7^1 + for2<i</c, (16) 

2N\2J Tbp 

where 5i^2 = 1 if ^ = 2 and is otherwise. As equation f|T6|) follows from the simple 
change of variable in equation (1141) it will hold under any homogeneous sweep model where 
sweeps instantaneously (relative to a time scale of 2N) force lineages to coalescence, with 
Jki replaced by so me constant th a t doe s not depend on r^p or vbp- This result greatly 



generalizes that of IKaplan et al\ (119891 ) who described a similar coalescent process for a 
full sweep model. 
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We can see from equation (fT6|) that 2Ni'Bp/rBp is the relevant compound parameter 
that in a general sweep model determines the rate of sweeps relative to neutral coalescent 
events. In small samples, sweep-induced coalescent events will dominate those due to drift 
if the population-scaled rate of sweeps per unit of the genetic map is much greater than one, 
provided that not all the Jk,i are too small. We revisit this strong sweep limit in Section 12.71 



The coalescent process with homogeneous partial sweeps. 

We now return to the setting of section 12.3.11 in which a simple trajectory rises quickly 
to frequency x, under which assumptions q{r,X) ^ (equation (^). We suppose that 
the frequency x at which each sweep slows is chosen independently with probability density 
g{x). It also seems reasonable to assume furthermore that tx, the time it takes to reach 
frequency x, does not depend on x; we will denote this time by t. This is approximately 
true for many models of directional selection, since selected alleles move quickly through 
intermediate frequencies. In this case, the rate at which k lineages coalesce to k — i + 1 is 

A.4Q)^..+ (*)^f (/'(--)•(!— (IT) 

suggesting that the important quantity, which acts as a coalescent time scaling, is 2Ni'Bp/{trBp), 
with the distribution on x acting to control how many lineages are forced to coalesce with 
each sweep. If we determine t by a simple model of additive selection with selection coefficient 
s, the key parameter becomes 2NiyBps/(log{Ns) rpp)- 

This compo und parameter , 2Ni> B ps/(log(Ns) tbp ) , is a lso the key parameter in full 
sweep models (IKaplan et al. 1 Il989f Istephan et ali Il992h . However, since full sweeps 



require x = 1, if diversity is strongly reduced then numerous lineages must merge at each 
sweep, which in turn leads to a strong skew towards rare alleles in the frequency spectrum. 
We will see that this relationship between the reduction in diversity and the skew in the 
frequency spectrum is substantially weakened under a partial sweep model when we allow 
X -Cl. 



2.6 Summaries of neutral genetic diversity. 
2.6.1 Level of neutral diversity. 

A key quantity of interest is the level of neutral nucleotide diversity, vr, the number of 
differences between randomly sampled alleles at a neutral locus. Under an infinite sites 
model of mutation, which we will use here, the expectation of vr, averaging across sites, is 
equal to the expected coalescent time of a pair of lineages multiplied by twice the mutation 
rate. If the mutation rate per generation at our neutral locus is yU., in the absence of sweeps, 
the level of diversity is E[7r] = 9, where 9 = 4Nfi is the population-size scaled mutation rate, 
and the expectation is the average across sites. Note that 9 is the level of diversity under 
the usual neutral model. 
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Under our model featuring both sweeps and drift, 

Efd = . (18) 

so a key parameter is the population-scaled rate of sweeps 2iVz/. 

To examine the applicability of our approximations we again performed coalescent sim- 
ulations with mssel for a selected locus at a fixed location experiencing recurrent sweeps. 
In this case, where selected alleles recurrently sweep into the population at a fixed genetic 
distance r, following our simple partial sweep trajectory again as characterized by qx and 
2N, the nucleotide diversity is given by 

g 

Eln] = J -. (19) 

We used two types of recurrent trajectory - the recurrent 'step' and the recurrent 'top- 
hat', as described earlier. For the recurrent top-hat trajectory, we simulated an exponential 
waiting time with mean u between the end of one 'top-hat' and the start of the next (and 
similarly for the 'step' case). In Figure [3] we show diversity levels moving away from the 
locus undergoing these two types of recurrent sweeps, as well as the analytical approximation 
given by equation (fT9|) . Recall that in both types of trajectories the derived allele pauses at 
frequency x for time T, and therefore we expect that the fate of the allele will affect diversity 
at recombination distances smaller than 1/T. For distances larger than 1/T, equation (fT9|) 
shows good agreement with our simulations, regardless of whether the recurrent sweeps go 
to loss or fixation. The approximation does not perfectly match our simulations, presumably 
because e"*"^*^ is an imperfect approximation to the probability of recombination during the 
sweep. Nevertheless, diversity levels generated by the two types of recurrent trajectory agree 
away from the selected site, which importantly confirms that only the initial rapid stage of 
the trajectory affects the coalescent process at partially linked sites. 



The level of diversity under homogeneous sweeps. Under the model in which sweeps 
occur homogeneously along an infinite sequence, with coalescent rates given by equation f[TB|l . 
the level of nucleotide diversity is given by 



e 



2NuBpJ2,2/rBP 



(20) 



These results generalize previous results by lKAPLAN et all ( 119891 ) and lSTEPHAN et a/.l ( 119921 ). 
who found a relationship of the form (j20|) for a model of homogeneous recurrent full sweeps. 
In fact, since equation (I2U]) follows only from the assumption that the rate and characteris- 
tics of sweeps are independent of their location along the genome (see equation dH])), this 
relationship between diversity, the density of selective targets, and recombination rate will 
hold for a wide variety of homogeneous recurrent sweep models including the homogeneous 
full sweep model. 
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Figure 3: Reduction in diversity (vr/^) as a function of recombination distance from 
a site experiencing recurrent sweeps. The three panels are for different values of the frequency 
X that each sweep reached rapidly. The solid line is for recurrent top-hat trajectories and the 
broken line for recurrent step trajectories The time that the trajectory pauses is T/2N = 0.01 
and tx/2N = 0.003 in both cases. The three colors correspond to three different population- 
scaled rates of sweeps: 2Nh' = 2, 4 and 8. The vertical grey line marks recombination 
distance r > 1/T from the selected locus, above which the dynamics subsequent to reach x 
should make little difference. The solid black lines give the prediction of f|T9l) . 
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2.6.2 Frequency Spectrum. 

We now study the effects of recurrent partial sweeps on other properties of neutral diversity 
at a locus besides pairwise nucleotide diversity, and compare our calculations to simulation. 

Two commonly studied properties of a sample of neutral diversity at a locus are the 
expected number of segregating sites in a sample of size n, and the expected number of 
singletons in a sample of size n. Under the infinite-sites assumption, these are respectively 
equal to the mutation rate multiplied by the expected total length of the genealogical tree 
of the sample (which we denote Ttot) and by the mutation rate multiplied by the expected 
total length of the terminal branches (Ti). We provide recursions that allow easy calculation 
of both E[Ttot] and E[Ti] in Appendix Ol 

We also look more generally at the frequency spectrum of segregating alleles, which is, 
in a sample of n individuals, the proportion of segregating sites at which k derived alleles 
are found, for each 1 < k < n. Let F^^k denote the expected proportion of segregating 
sites in a sample of size n at which exactly k samples carry the derived allele under an 
infinite sites model of mutation. F„ ^ is equal to the expected time in the coalescent tree 
spent on branches that subtend exactly k tips (those on which mutation would lead to a 
site segregating at k out of the n samples), divided by E[Ttof]. Under neutrality (Kingman's 
coalescent), this quantity is F^^, = / k) / Y^^Zii^ / j) ■ It is not so easy to find an explicit 
general expression under the coalescent model with sweeps that we study, but for the case 
/c = 1 we have described in Appendix I A. 2 1 how to compute E[Ti]/E[Ttot], and the general 
case can be found from simulation of the coalescent process. 

Figure IDA shows the ratio of Fn^k/F^k^ estimated by direct simulation of our coalescent 
process. The rates are given by equation dH]), with q fixed to = xe"*""^, and tr^r = 0.6 (and 
various x). To make the simulations comparable, the population scaled rate of sweeps 2Nu 
was adjusted such that 7t/6 = 1/2 in each, i.e. to obtain a 50% reduction in diversity due 
to sweeps. We see that for partial sweeps at a fixed site, across a range of x, the frequency 
spectrum is skewed towards rare alleles and away from intermediate frequency alleles. 

To test the degree to which our coalescent matches the full model, in Figure |1]B we 
compare the mean proportion of singleton sites under our coalescent model to that found 
from simulation with mssel. We simulated a recurrent top-hat trajectory of the frequency 
at a selected locus as before, and used this trajectory with mssel to simulate the neutral 
coalescent at a non-recombining locus a distance r away from this selected locus. We used 
the three values x = 0.9, 0.5, and 0.2 for the intermediate frequency the allele reached, 
and in each case varied the rate of sweeps, u Each combination of u and x gives a point 
in Figure |1]B, plotted at its mean reduction in diversity (vr/6') and the mean number of 
singletons divided by the mean number of segregating sites. These are compared to the 
analytical values of E[Ti]/E[Ttot] computed using equations fl2^ and ([SI]), with coalescent 
rates given by equation using a constant q = xe"''*'' and fl^Ul) to find the reduction n/O. 
There is good agreement between the simulations and the analytical results, showing that 
our simplified process approximates the properties of the full coalescent process at a single 
site reasonably well. 

Figure H] studied the effect on the frequency spectrum of recurrent sweeps at a fixed 



17 




Figure 4: Properties of the frequency spectrum with sweeps occurring at a fixed 
genetic distance Coalescent rates are given by equation iQ, with q fixed to = xe~^^^ and 
tj.r = 0.6, across a range of x. (A) The percentage of segregating sites found at frequency 
1 < /c < 20, relative to the neutral expectation (i.e. F2o^k/F'2o k)- these simulations the rate 
of sweeps Nu has been fixed to result in a 50% reduction in diversity. The dotted grey line 
gives the neutral expectation. (B) The mean number of singletons divided by mean number 
of segregating sites, from mssel simulations with a sample size of 10 at a neutral site a 
distance 2Nr = 200 from a selected site. The selected allele performs a recurrent top-hat 
trajectory (with = 10, 000 and t^/2N = .003, giving rt^ = 0.6, and pausing T/2N = 0.01) 
to frequency x = 0.2, x = 0.5, or x = 0.9 across a range of 2Nh'. Note the span of 7r/6 is 
smaller in the low x simulations as the effect on diversity of a given 2Ni' is smaller. Solid 
lines show the analytical approximation for E[Ti]/E[Ttot] of Appendix I A. 2[ The dotted grey 
line gives the neutral value of the expected proportion of singletons 1/ X]?=i Vj- 
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distance from a neutral site; in Figure[5]we study the frequency spectrum under the coalescent 
process with sweeps occurring homogeneously along the genome. Figures [3J\ and B show the 
same quantities as Figure HJA., for simulations of the homogeneous partial sweep coalescent 
process with a fixed value of x, using rates given by equation f lT7|) . and 2Nubp / {tfBp) chosen 
so that TT is 50% and 10% of its value under neutrality respectively. In Figure |5p, there is no 
genetic drift and only sweeps force coalescence, i.e. N = oo and so we do not need to specify 
2Ni'Bp/{trBp) as it acts only as a time scaling. In [SP we show our analytic calculation of 
E[Ti]/E[Ttot] as a function of the reduction in vr caused by selective sweeps. 

The skew in the frequency spectrum depends strongly on the frequency x reached by the 
selected allele. Sweeps to low frequencies lead to a much smaller distortion for the same 
reduction in tt. Therefore, the strong relationship between the reduction in vr and the skew 
in the frequency spectrum under a model of full sweeps is much weaker if the sweeps do not 
go to fixation. 

Intriguingly, sweeps that go to intermediate frequency can lead to a greater proportion of 
high frequency derived alleles than under a full sweep model. While a single, r ecent full sweep 
leads to high frequency derived alleles through hitchhiking ( IFay and Wui . |2000[ ). under a 
recurrent f ull sweep rn odel these alleles are then fixed in the population by subsequent sweeps 
and drift ( IKimI . |2006| ). and therefore removed from the frequency spectrum. Further work 
would be needed to understand the intuition for the excess of high frequency derived alleles 
under a recurrent partial sweep model. 



Summaries of the frequency spectrum In Figures H] and |5l we saw that regardless 
of whether sweeps occur at a fixed distance from our neutral site or homogeneously along 
the sequence, as we increase the rate of sweeps the frequency spectrum becomes further 
skewed towards rare derived alleles at the expense of intermediate frequency alleles. Here 
we provide evid ence tha t this will hold fo r any set of parameter values. Tajima's D and 
Fu and Li's D ( ITajimaI . Il989l : iFu and Li Il993l ) are two common ways of detecting devi- 
ations away from the frequency spectrum expected under a neutral model with a constant 
population size. Negative values of Tajima's D can be thought of as indicating a deficit of 
intermediate frequency alleles, and Fu and Li's D indicates an excess of singleton alleles. 
DURRETT and SchweinsbergI (j2005| ) proved that in large samples, both of these summary 
statistics are negative under a multiple mergers coalescent model of full sweeps, as long as 
Afc, the total coalescent rate when there are k lineages, satisfies 



E 

k=2 



A;2 



< oo. 



(21) 



See equation (4.5) in IDurrett and SchweinsbergI (120051 ). Informally, this condition re- 
quires that the total coalescent rate is not too much higher than the neutral coalescent rate 
when there are a large number of lineages. Their methods were not specific to their situa- 
tion but hold for all multiple merger coalescent models satisfying equation ( 12T|) . As above, 
we argued that a generalized sweep model can be approximated by a multiple merger coa- 
lescent, and therefore, it seems that reasonable generalized sweep models will, at least for 
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Figure 5: Properties of the frequency spectrum under a spatially homogeneous 
model of sweeps using the coalescent process with rates given by equation ( fT7|) . Simu- 
lations were performed for a sample size of 20. For a particular x we adjusted the value 
of NuBp/itrBp) to achieve the specified reduction in vr. (A) and (B) The percentage of 
segregating sites found at frequency 1 < k < 20, relative to the neutral expectation for 
sweeps. In each panel the reduction in diversity, 7r/6 is fixed. (C) The same quantities as in 
A and B, but for the case where there is no genetic drift, and sweeps are the only stochastic 
force affecting allele frequencies. (D) The fraction of segregating sites that are singletons, 
for different x, as a function of ii/O, calculated using recursions for E[Ti]/E[Ttot] (Appendix 
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large samples, have a frequency spectrum that is skewed towards singletons at the expense 
of intermediate frequency alleles (a notable exception is the 'low frequency' limit we discuss 
below) . 



2.7 Limiting processes 

Before we move to discuss the implications of these results for data analysis there are two 
limiting processes that merit our attention. The first is when the rate of sweeps is sufficiently 
high to dominate genetic drift as a source of stochasticity. The second limit results when 
sweeps very rarely achieve high frequency in the population, in which case the resulting 
coalescent model is identical to the standard "neutral" coalescent, despite that fact that 
much of the stochasticity may be driven by sweeps. 



The rapid sweep limit 

A surprising conclusion from the homogeneous model and equation ( fT6ll is that if all coales- 
cences come from "selective" events, then the frequency spectrum does not depend on the 
density of selective targets or on the recombination rate (although the number of segregating 
sites certainly does). This effect can be seen in Figure |5p as the fraction of singleton sites 
plateaus when the reduction in vr is large, i.e. when the population scaled rate of sweeps 
per unit of recombination is high, i'bp/i^bp ^ 1/2N. The easiest way to see this is to take 
N ^ oo while keeping the rate of sweeps and their trajectory dynamics fixed, so that in a 
sample of fixed size the coalescence rate from equation (fT6|) converges to Xk,i — >■ t^bp lfBpJk,ii 
where Jk,i does not depend on vbpi "^bpiOiN. In this limit, vbp and tbp only affect the 
process by a time scaling, do not affect the transition probabilities of equation (fTTl) . and so 
do not affect the frequency spectrum. Diversity in this limit behaves as 

m = (22) 

(assuming, as usual, that /i is sufficiently small) i.e. nucleotide diversity increases linearly 
with the recombination rate, if neither Ubp or J2^2 varies across recombination environments. 
Similar limits can also be derived by letting — t- oo under the more general coalescent 
process with rates given by equation ([7j). 

For this limit to be a reasonable approximation for a sample of size in a population of 
size N, we need the rate of neutral coalescences to be much smaller than the rate of selective 
coalescences, i.e. (2) ^ Ni'Bp/rBpYli=2 In sufficiently large samples, (2) will be large 
enough that the coalescence rate due to genetic drift will be appreciable, at least until the 
number of lineages surviving back in time declines. From a technical standpoint, this is 
related to the question of whet her the coalescent process "comes down from infinity" (for a 



review see 



BERESTYCKj . |2009| ) 
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The low frequency limit 

As noted in our discussion of Figure [5], the frequency spectrum may be close to neutral in 
appearance even with large reductions in vr if selected alleles sweep only to low frequency. 
In fact, by taking a limit (satisfying certain conditions) in which sweeps occur frequently, 
but each sweep has a small probability of causing coalescence, we can recover Kingman's 
coalescent. 

We illustrate this limit by taking — oo and allowing f{q) to depend on u in such a way 
that as z/ -> oo, Ik//Ik,2 -> for all 3 < £ < k, and that u Ik,2 — > (2)7' some < 7 < 00. 
As shown in Appendix IA.31 a sufficient condition for this is that lim,^^oo ^ Jq f{(l)dq is 
finite. In this limiting case, the rate of coalescence is 

so the limiting model behaves exactly as the standard neutral coalescent but with an eff'ective 
population size of 

iV. = (24) 

2iV7 + l ^ ' 



Note t hat the limiting coalescent process does not satisfy condition ( 12T|) of lDURRETT and Schweinsberg 



(120051 ). and that Tajima's D and Fu and Li's H will have mean equal to zero at all sample 
sizes, as is natural since the limiting process is just the neutral (Kingman's) coalescent. 

In the case of our simple partial sweep coalescent this limit would occur if the frequency 
X reached by sweeps is taken to zero as the rate of sweeps grows at least as The simple 

homogeneous full sweep coalescent process obviously can not be taken to this limit as there 
is a proscribed set of ,, which feature non-trivial amount of coalescence involving more 
than pairs of lineages. 

Interference In both limits discussed above the population-scaled rate of sweeps has to 
be very high. In the first limit the rate of sweeps has to be high enough to dominate the 
rate of neutral coalescence, in the second limit the rate of sweeps has to be high enough 
to compensate for the fact that any one sweep is very unlikely to cause coalescence. The 
requirement of a high rate of sweeps implies that interference between the sweeps may occur, 
thus violating our assumption that the sweeps are independent. Investigations of the effect of 
such interference on the signal of hitchhiking have shown that interference reduces the impact 



of an y one sweep on patterns of polymorphism (IKlM and Stephani . l2003l : IChevin et al 



20081 ). although to interfere, the sweeps must begin at very similar times at loci separated 
by a low recombination rate. This suggests that a very high rate of sweeps is needed indeed 
before interference will have an appreciable impact on the hitchhiking effect, as would occur 
in the homogeneous sweep model Hvep/^bp is very large. The limits we describe above only 
require that the population size-scaled rate of sweeps (iVz/ or Nvbp) be high, and therefore 
it is possible to keep the per generation rate of sweeps sufficiently low as to avoid the effect of 
interference. Further work is needed to investigate coalescent models under such high rates 
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of sweeps, and could be useful in understanding genealogical processes in organisms with low 
or no recombination that also experience strong selection pressures. 



3 Discussion 



The prevailing view of adaptation in a population genetics setting is based on a lone selected 
allele racing from its introduction into the population to fixation, carrying with it a chunk 
of the chromosome on which it arose. This cartoon has been a very useful prop for develop- 
ing tests to identify genes underlying recent adaptations, and for interpreting genome-wide 
patterns of polymorphism. However, it seems likely that such full sweeps constitute only a 
small proportion of the s elected loci whose frequency changes in response to adaptation (see 
Pritchard et a/.Ll2010L for a recent discussion). If we are to develop a better understanding 
of the full impact of linked selection on patterns of diversity we need to develop a richer and 
more flexible set of models. 

The work in this paper was motivated by models in which the external environment 
or the genetic background vary on a fast enough time scale that new alleles rarely reach 
fixation before selective pressures change, either slowing their advance or reversing their 
trajectory. We laid out an approximation to the coalescent process under such a model, and 
showed that, while the initial rapid stage of the trajectory will strongly impact the coalescent 
process, subsequent slower dynamics of the selected alleles have a much smaller effect. We 
then extended this idea to a recurrent sweep model, approximating the dynamics by a 
multiple- merger coalescent. While some of our results are fairly general, to provide a more 
intuitive sense we have often employed simple allele frequency trajectories and made other 
approximations. Nonetheless, we expect more realistic models to give rise to qualitatively 
equivalent results. 

Each sweep we consider consists of a single allele at a locus rising on a single haplotype 
from very low frequency in to the population. This contrasts with many other soft sweep 
models, under which a sweep starts on r nultiple haplotypes, either because multiple different 
alleles initially segregated at the locus ( IHermisson and PenningsI. 120051): or as a result of 
multiple mutations occurring after selection pressures switched (IPennings and Hermisson . 
2006al lbl: IRalph and CooP . 2010 ): or because the adaptiy e allele was previous l y neu tral 



and present on multiple haplotypes (IInnan and KlMl . l2004l : IPrzeworski et all |2005| ). It 



is likely that recurrent models of such soft sw eeps could be approxim ated through coalescent 
models with simultaneous multiple collisions (ISchweinsbergI . |2000| ). to model the simulta- 
neous rise of multiple haplotypes. This seems like a fruitful area of future work as it would 
substantially extend our understanding of the effects of a broad family of recurrent sweep 
models on genomic patterns of diversity. 

We have also ignored the effect of background selection. To a first approximation, the 
effect of background selection can be modeled as an increase in coalescence rate, which would 
be a minor modification to equations fl9]) and ffT6l) . This would alter the predicted relationship 
between diversity and recombination fjiNNAN and StephanI . |2003| ) given by equation fl20|) . 
and would offer a simple way to model the genealogical effects of both general models of 
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hitchhiking and background selection. 



The interpretation of population genomic patterns 

Models in which selective sweeps do not always sweep to fixation have a much wider spec- 
trum of predictions than the recurrent full sweep model. Three broad correlations that 
have been used to argue for the prevalence of linked selection, and used to potentially dis- 
criminate between models invoking background selection or full sweeps are: 1) correlations 
between neutral diversity and the recombination rate; 2) correlations between the frequency 
spectrum and the rate of recombination; and 3) correlations between putatively adaptive 
divergence and neutral diversity. We now describe some of the implications of our results 
for understanding these patterns in population genomic data. 



Correlation between recombination and diversity One of the earliest and most com- 
pelling pieces of evidence for the role of linked selection in the fate of neutral alleles is a 
positive correlation between recombination and levels of diversity at putatively neutral sites 
(factoring out substitution rates as a proxy for differences in mutation rate). This pattern 
is consistent with both full sweeps and background selec t ion, a s both predict positive, albeit 
differently shaped, relationships fllNNAN and The shape of the diversity- 

recombination curve under a homogeneous rate of partial sweeps is identical to the full sweep 
model, and more generally for a broad class of homogeneous sweep models. In fact, the re- 
lationship under a homogeneous model only depends on 2NvbpJ2,2i as seen in equation 

To illustrate this point, in Tabled] we present estimates of 2NubpJ2,2 for humans and 
Drosophila melanogaster, assuming a model with drift and a ho mogeneous rate of s e lectiv e 
sweeps across t h e gen ome, and from equation (120!) and data from IHellmann et all (120081 ): 



Shapiro et al\ (120071 ). Along with these estimates. Table [T] also shows the implied rate of 



sweeps per generation per base pair, usp, under the simple partial sweep model, for a variety 
of values of x. These rates are surely overestimates, are intended for illustrative purposes 
only, as they ignore the effect of other forms of linked selection, e.g. background selection. 

The strength of the relationship between diversity levels and recombination varies dra- 
matically between the two species, as indicated by the very different estimates of 2NvbpJ2,2 
(note that the estimates of vbp are similar due to the thousand fold difference in N). In 
Drosophila the positive relationship between recombination and diversity is strong (e.g . 



Aguade et al. 1989: Begun and AquadrqI. Il992l : IBerry et all Il99ll : IShapiro et al 



20071 : IBegun et all 120071 ) , but in huma ns the relationship seems to be weaker and is and 



complicated by other confounding f actors ( IPayseur and Nachmani . |2002| : IHellmann et al. 



20031 . l2005l . I2OO8I : ICai et all 120091 ). However, we should be cautious in the biological inter- 



pretation of this difference, as in humans diversity is usually estimated in large windows 
(much of which will be noncoding and far from genes), while in Drosophila neutral diversity 
levels are usually estimated from synonymous sites in individual genes. What is needed is 
a comparative analysis that studies these patterns at the same genomic scale and accounts 
for the profound differences in the density of functional targets among species. 
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9 2NUBPJ2,2 


vbp across a range of x 
X = 1.0 a; = 0.5 a; = 0.2 a; = 0.05 


Human 
D. mel 


0.0017 6 X 10-11 
0.025 7.3 X 10-9 


3.0 X 10-12 1.2 X 10-11 7.5 X 10-" 1.2 x 10-^ 
3.6 X 10-12 1.5 X 10-11 9.1 X 10-11 1.5 x 10-^ 



Table 1: Estimates of sweep parameters from the relation ship between div ersity 
and recombination. The estimate for humans was taken from Hellmann et al. (12008. ) 
who fitted a curve of the form of equation fl20l) . The estimate from Drosophila melanogaster 
[D. met) was obtained from fitt ing equation ([201) to th e synonymous polymorphism and sex- 
ave raged recomb i nation rates of lSHAPiRO et all (120071 ) (kindly provided by Peter Andolfatto, 



see 



Sella et ali (120091 ) for details) using non-linear least squares via the nls() function in 
R. These estimates were converted into estimates of the rate of sweeps per generation per 
base pair {i'bp, last four columns) under the simple partial sweep trajectory model where 
assuming t^, = 1, 000 generations (equivalent to a selection coefficient of ~ 0.01) 
and that = 10^ in D. mel and = 10^ in humans. 



The fact that the diversity-recombination curve plateaus rapidly in humans is strong 
evidence that linked selection does not affect the average neutral site in regions of high 
recombination. Technically, this could also occur if the density of selective targets ubp 
decreases approximately linearly with recombination rate; however, this option does not 
seem likely a priori. 

Although in Drosophila melanogaster this curve is still concave , it do es not appear to 
flatten completely in high recombination regions (e.g. ISella et all l2009l ). suggesting that 
linked selection is an important source of stochasticity even in these regions. At face value the 
concave nature of the curve suggests that both genetic drift and linked selection contribute 
to stochasticity, as Nvbp ^ rBp would lead to an almost linear relationship across the 
observed range of recombination rates (see equation (l22l) ). However, a model with effectively 
no genetic drift can produce a concave curve and flt the observed data if vbpJ2,2 is not 
constant across recombination environments, e.g. if sweeps occur at a moderately higher 
rate or achieve higher frequency in high recombination regions. Neither of these two options 
seem particularly unlikely, suggesting that we still have little unambiguous evidence favoring 
genetic drift as an important source of stochasticity in Drosophila. 



The frequency spectrum The recurrent full sweep model predicts a strong positive 
relationship between the reduction in neutral diversity and the skew towards rare alleles 
( IBraverman et all Il995t IKimI . l2006l ). a pattern not predicted under models of strong back- 
ground selection. This relationship has been used to test between full sweeps and background 
selection models, although note that as we discussed in Section [^Tf this relationship is not ex- 
pected if all coalescence comes from selective sweeps. Under our simple trajectory model, the 
distortion of the frequency spectrum is primarily determined by the frequencies that sweeps 
achieve. Therefore, although a lack of a strong skew in the frequency spectrum is consistent 
with a low rate of full sweeps, it cannot be used to rule out a high rate of partial sweeps. 
A lack of a genomic relationship between the frequency spectrum and recombination rate is 
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therefore not grounds for rejecting sweeps as a force in shaping genetic diversity in favor of 
a model of background selection. Our results suggest that recurrent partial sweeps to low 
frequency in regions of high recombination in D. melanogaster and in the low recombination 
regions in humans may be a major source of stochasticity in allele frequencies. 



Correlation between divergence and polymorphism. Attention has recently focused 
on examining the correlation between neutral diversity and amino acid substitutions (or 
other putatively functional changes) between recently separated species. If a reasonable 
fraction of amino acid substitutions are driven by new mutations sweeping to fixation, then 
levels of diversity should dip on average around amino-acid substitutions. This relation- 
ship has been tested for by lo oking for a positive correla t ion between diversity levels and 



amirio-acid substitution rate s ( IMacpherson et all 120071 IAndolfattqI . 120071 : ICai et al. 



20091 : IHaddrill et ali 120111) or for a dip in d i versity levels around a l arge s et of aggregated 



amino acid substitutions ( IHernandez et ali 1201 It ISattath et ali 1201 ih . If the density 



of functional sites is properly controlled for, these types of correlations between amino-acid 
substitutions and neutral diversity are not expected under a (si mple) model of background 



selection. Such correl ations have been detected in Drosophila (IMacpherson et ali 12007 



Sattath et ali 1201 ll ) but in humans the dip in diversity around non-synonymous substi- 



tutions seems to result from the dip in diversity levels a round genes, an observati on that 



seems inconsistent with a high rate of strong full sweeps ( IHernandez et ali 1201 ll ). Simi- 



larly, it has been observed that the highest Fst signal s between hu i nan p opulations are not 
associated with strongly reduced haplotypic diversity ( ICOQP et aU 120091 ). 

The fact that selected alleles in the partial sweep coalescent model do not have to sweep 
all the way to fixation partially decouples the rate of fixation of adaptive alleles from their 
effects on patterns of diversity within populations. Therefore, the strength of the positive 
relationship between substitution rates and diversity depends on the fate of alleles that 
sweep into the population. For example, this positive relationship may be weak, and a poor 
predictor of the total reduction in diversity, if the majority of adaptive alleles that initially 
sweep into the population are eventual ly lost ( e .g. as can be the case for major effect alleles 



in polygenic models of adaptation, see ILandeI . Il983l : IChevin and HospitalI . |2008| ) 



Concluding thoughts In this article, we have concerned ourselves with patterns of di- 
versity at a single neutral site. However, partial sweeps also have a strong effect on linkage 
disequilibriu m and haplotype diversity, a signatu r e tha t has been exploited i n scans for se- 



lection (e.g. IHUDSON et ali 1 1994 ISabeti et ali l2002l : IVoight et ali l2006l ). One simple 

In that limit, the 



case that we can immediately describe is the low q limit (section [23 
coalescent is equivalent to the standard neutral model and as such the decay of LD will be 
the same as the standard neutral model with an A'^e given by equation (I24p . A natural A vay to 
extend this exploration would be the genealogical framework develo ped bvlMcVEANi (120071) 
that h as recently been extended to a multiple mergers coalescent by IEldon and Wakeley 
(l2008h . 
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We will soon have polymorphism data across a broad range of taxa that will differ dra- 
matically in selection regimes, recombination rates, genome size, and population size allowing 
a much fuller picture of how these various factors interplay to shape genome-wide levels of 
polymorphism. The results presented here, however, suggest that we will continue to strug- 
gle to distinguish between modes of selection, as relaxing the assumptions of various models 
can generate a broad range of overlapping predictions. 

Despite that, our results suggest a promising way forward, since a broad range of sweep 
models can be captured by a simple parameterizations of multiple merger coalescence pro- 
cesses. Importantly, this would allow parameter inference under a general model of linked 
selection, rather than focusing on a limited number of specific models. For example, we could 
estimate the rate that selection forces different numbers of lineages to coalesce (parameter- 
ized by i'f{q)) as function of recombination rates and the density of selective targets. As the 
multiple-mergers coalescent model is easily simulated under, it may be readily incorporated 
into many of our existing genealogical inference frameworks. It is likely that parameters 
of such models could be estimated very precisely from genome-wide data, allowing us to 
concentrate on what these high level summaries of polymorphism tell us about linked se- 
lection across genomic environments and species. Such inferences may be important if we 
wish to move beyond documenting the presence of linked selection towards describing the 
genealogical process in species where selection is a major source of stochasticity. 
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A Appendices 

A.l Jfc i for a generalized trajectory 

Recall that we defined in equation f[T^ 



JkA = ( ^ )Ex 



q{X,r)\l- q{X,r)Y-'dr 



2<i<k, (25) 



so that the rate at which the coalescent process having k lineages coalesces down to i lin- 
eages from "selective" events is ubp /fBpJk,i- The quantity g(X, r) is the pathwise Laplace 
transform of the process X, defined in equation ([3]), and consequently 

/■oo 

1 - g(X, r) = / re-''\l - X{t))dt. (26) 
Jo 
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It is useful to note that by changing the order of integration, 



k\ 



E 



X 



n'.,A-fe)nWi(i--^(*^)) 



fc+1 



dti ■ ■ ■ dtf; 



(27) 



for 2 < 2 < fc, as long as the integral is finite. In the case of a pair of lineages i = 2 and this 
simplifies to 



J- 



2,2 



2E 



X 



(28) 

L^O JO {tl+t2Y 

To briefly explore the conditions for J to be finite, we will suppose that X leaves zero 
as a power of t, i.e. X{t) ~ for some a > 0, for small t. We see that Jk,2 increases as 
a increases, i.e. the rate of sweeps is larger the more rapidly X leaves zero. In this case, 
q{r) ~ C r~" for large r, where C is a constant. Then since 



lim ' 

L— >oo \ 2 



< lim 

L— >oo 



q{ry{l-q{r)f-'dr 



q{rYdr, 



it can be seen that Jk,2 is infinite if a < 1/2, in the limit of an infinite genome. More 
generally, if X leaves zero more quickly than y/t (which may be biologically unrealistic), 
then sweeps occurring arbitrarily far away along the genome will cause coalescences. 



A.2 Recursions to find E[Ttot] and E[Ti] 

Two properties of interest are the expected total amount of time in the genealogy at a neutral 
locus (E[Ttof]) and the expected total amount of time in terminal branches (E[Ti]). 

We first derive the expected total time in the genealogy. Recall that if the coalescent 
process has k lineages, then it waits an exponentially distributed amount of time with mean 
1/Afc, and then jumps to a smaller number of lineages chosen with probabilities according 
to pi^^i, with Afc and pk^e given in equations fIlOp and f llip . Therefore, if we let Gn,k be the 
probability that the coalescent process that starts from n lineages ever visits the state with 
k lineages, then 

" k 

E[Ttot] = V ^Gn,k. (29) 

By conditioning on the last state visited before dropping to k lineages, we can see that Gn,k 
satisfies the recursion 

n 

Gn,k = ^ Gn,i Pi,k, for k <n, (30) 

i=k+l 
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with Gn^n = 1- This recursion is of upper triangular form, so is easily solvable, which together 
with ( 12^ allows us to compute E[Ttoi]. 

We now turn to the expected total time in terminal branches, i.e. those branches on 
which mutations will lead to singletons. Note that, since all lineages are exchangeable, E[Ti] 
is equal to n times the mean time until a particular lineage - say, the first one - coalesces 
with any other. To find this, let Sn,k be the probability that at some point there are k 
lineages, and that one of those k lineages is the original first lineage, still not coalesced with 
any others. Then the mean time until the first lineage coalesces is X]fc=2 '^^n,ki and hence 

n 

E[ri] = n V —Sn,k. (31) 

As above, we can get a solvable recursion for Sn,k by conditioning on the last coalescent 
event before reaching k lineages. If the coalescent process jumps from I to k lineages, then 
the probability that a given lineage is not part of this coalescent event is [k — !)/£, and hence 

" A; - 1 

Sn,k = 22 ^n,iVi,k^— ioTk<n, (32) 
e=k+i 

and Sn,n = 1- The recursion is also easily solvable, which lets us obtain E[Ti]. 
A. 3 More on the low q limit 

We would like to arrange things so that asymptotically, all coalescent events affect only two 
lineages. We illustrate this limit by taking i/ — >• oo and allowing f{q) to depend on u in such 
a way that as z/ — oo, Ik//Ik,2 for a\\ 3 < £ < k, and that u Ik.2 (2)^' -^^^ some 
< 7 < 00. Since this model is a Lambda coalescent with A{dq) = q^vf{q)dq + 6o{dq)/2N, 
if we rescale time by a factor of C, a necessary and sufficient condition is that CA converges 
weakly to a point mass at 0. 

To emphasize the dependence of / on z/ we write /(g) = fuig) and Ik/ = Ik,e{^)- We 
would like to find a simple condition under which the proportion of coalescences involving 
more than two lineages goes to zero, i.e. that Ik/{'^)/Ik,2{^) — )■ as — )■ 00 if ^ > 2. Fix k, 
and suppose for convenience that f{q) =0 for all g > 1 — e, for some e > 0. Then 

fq'fM)dq< f - qf~' fM)dq < [\'fu{q)dq, 
Jo Jo Jo 

so that Iki{^) / Ik,2{^) — > if and only if 

!lq^fv{.q)dq 



lo q^U{q)dq 



0. 
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Using Jensen's inequality, 



1 \^/2 



!lq^fM)dq Joq^fu{q)dq 

1 \ (^-2)/2 

(i^fM)dq 



so if q^ fv{,q)dq — )■ 0, this will be achieved. By the same result, 

KDlol'fMdq ' 
so that, rescaling time by a factor C^, if 

i^C„ / q^ fu{q)dq — 7- 7 as L — oo, 
Jo 

then vCyIk,2 — ^ (2)7 ^- this limit, the rate at which a pair of lineages coalesces 

converges, and does not depend on the number of lineages present. 

Ideally, we would illustrate this with an stochastic model for X. However, the formula 
requires the model to be analytically tractable to a degree satisfied by no population genetics 
models that we could think of, and it is much easier to make a concrete choice of f{q)- 
Consider the case where /(g) is the density of a Beta(l,M) distribution. The mean of this 
distribution is 1/(1 + M). In that case 

..^0/',',:-,,— .m.,.m(*)/(---), ,33) 

so that as M — 00, 

_ (k\ 2M2 (k 

^^'^'-[2) {M + k-l){M + k-2) '\2 



so if = M, then 7 = 2. We can furthermore check that 

ik,i _ OiKk + M-i-iy. 1 



4,2 Q 2\{k + M-3)\ 
so that this simple case satisfies our limit. 



0. (34) 
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